{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "spoken-reunion",
   "metadata": {},
   "source": [
    "# Sensitive Data Detection with the Labeler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "interesting-bidder",
   "metadata": {},
   "source": [
    "In this example, we utilize the Labeler component of the Data Profiler to detect the sensitive information for both structured and unstructured data. In addition, we show how to train the Labeler on some specific dataset with different list of entities.\n",
    "\n",
    "First, let's dive into what the Labeler is."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1965b83b",
   "metadata": {},
   "source": [
    "## What is the Labeler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "388c643f",
   "metadata": {},
   "source": [
    "The Labeler is a pipeline designed to make building, training, and predictions with ML models quick and easy. There are 3 major components to the Labeler: the preprocessor, the model, and the postprocessor."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d0aeb4",
   "metadata": {},
   "source": [
    "![alt text](DL-Flowchart.png \"Title\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "550323c7",
   "metadata": {},
   "source": [
    "Each component can be switched out individually to suit your needs. As you might expect, the preprocessor takes in raw data and prepares it for the model, the model performs the prediction or training, and the postprocessor takes prediction results and turns them into human-readable results. \n",
    "\n",
    "Now let's run some examples. Start by importing all the requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "scientific-stevens",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "import pandas as pd\n",
    "\n",
    "try:\n",
    "    sys.path.insert(0, '..')\n",
    "    import dataprofiler as dp\n",
    "except ImportError:\n",
    "    import dataprofiler as dp\n",
    "\n",
    "# remove extra tf loggin\n",
    "import tensorflow as tf\n",
    "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5125b215",
   "metadata": {},
   "source": [
    "## Structured Data Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "wicked-devon",
   "metadata": {},
   "source": [
    "We'll use the aws honeypot dataset in the test folder for this example. First, look at the data using the Data Reader class of the Data Profiler. This dataset is from the US department of educations, [found here!](https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources?resource=823ac095-bdfc-41b0-b508-4e8fc3110082)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adjusted-native",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = dp.Data(\"../dataprofiler/tests/data/csv/SchoolDataSmall.csv\")\n",
    "df_data = data.data\n",
    "df_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab6ccf8a",
   "metadata": {},
   "source": [
    "We can directly predict the labels of a structured dataset on the cell level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19529af4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "labeler = dp.DataLabeler(labeler_type='structured')\n",
    "\n",
    "# print out the labels and label mapping\n",
    "print(\"Labels: {}\".format(labeler.labels)) \n",
    "print(\"\\n\")\n",
    "print(\"Label Mapping: {}\".format(labeler.label_mapping))\n",
    "print(\"\\n\")\n",
    "\n",
    "# make predictions and get labels for each cell going row by row\n",
    "# predict options are model dependent and the default model can show prediction confidences\n",
    "predictions = labeler.predict(data, predict_options={\"show_confidences\": True})\n",
    "\n",
    "# display prediction results\n",
    "print(\"Predictions: {}\".format(predictions['pred']))\n",
    "print(\"\\n\")\n",
    "\n",
    "# display confidence results\n",
    "print(\"Confidences: {}\".format(predictions['conf']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2af72e2c",
   "metadata": {},
   "source": [
    "The profiler uses the Labeler to perform column by column predictions. The data contains 11 columns, each of which has  data label. Next, we will use the Labeler of the Data Profiler to predict the label for each column in this tabular dataset. Since we are only going to demo the labeling functionality, other options of the Data Profiler are disabled to keep this quick."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6cb9d7e-149a-4cfe-86f8-76c47c57aeea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper functions for printing results\n",
    "\n",
    "def get_structured_results(results):\n",
    "    \"\"\"Helper function to get data labels for each column.\"\"\"\n",
    "    columns = []\n",
    "    predictions = []\n",
    "    samples = []\n",
    "    for col in results['data_stats']:\n",
    "        columns.append(col['column_name'])\n",
    "        predictions.append(col['data_label'])\n",
    "        samples.append(col['samples'])\n",
    "\n",
    "    df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions, 'Sample': samples})\n",
    "    return df_results\n",
    "\n",
    "def get_unstructured_results(data, results):\n",
    "    \"\"\"Helper function to get data labels for each labeled piece of text.\"\"\"\n",
    "    labeled_data = []\n",
    "    for pred in results['pred'][0]:\n",
    "        labeled_data.append([data[0][pred[0]:pred[1]], pred[2]])\n",
    "    label_df = pd.DataFrame(labeled_data, columns=['Text', 'Labels'])\n",
    "    return label_df\n",
    "    \n",
    "\n",
    "pd.set_option('display.width', 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "secret-million",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# set options to only run the labeler\n",
    "profile_options = dp.ProfilerOptions()\n",
    "profile_options.set({\"structured_options.text.is_enabled\": False, \n",
    "                     \"int.is_enabled\": False, \n",
    "                     \"float.is_enabled\": False, \n",
    "                     \"order.is_enabled\": False, \n",
    "                     \"category.is_enabled\": False, \n",
    "                     \"chi2_homogeneity.is_enabled\": False,\n",
    "                     \"datetime.is_enabled\": False,})\n",
    "\n",
    "profile = dp.Profiler(data, options=profile_options)\n",
    "\n",
    "results = profile.report()    \n",
    "print(get_structured_results(results))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fatty-louisville",
   "metadata": {},
   "source": [
    "In this example, the results show that the Data Profiler is able to detect integers, URLs, address, and floats appropriately. Unknown is typically strings of text, which is appropriate for those columns."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "unavailable-diploma",
   "metadata": {},
   "source": [
    "## Unstructured Data Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "metallic-coaching",
   "metadata": {},
   "source": [
    "Besides structured data, the Labeler detects the sensitive information on the unstructured text. We use a sample of spam email in Enron email dataset for this demo. As above, we start investigating the content of the given email sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "unauthorized-lounge",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# load data\n",
    "data = \"Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>\\n\" + \\\n",
    "        \"Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)\\n\" + \\\n",
    "        \"From: w..smith@company.com\\n\" + \\\n",
    "        \"To: john.smith@company.com\\n\" + \\\n",
    "        \"Subject: RE: ABC\\n\" + \\\n",
    "        \"Mime-Version: 1.0\\n\" + \\\n",
    "        \"Content-Type: text/plain; charset=us-ascii\\n\" + \\\n",
    "        \"Content-Transfer-Encoding: 7bit\\n\" + \\\n",
    "        \"X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>\\n\" + \\\n",
    "        \"X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>\\n\" + \\\n",
    "        \"X-cc: \\n\" + \\\n",
    "        \"X-bcc: \\n\" + \\\n",
    "        \"X-Folder: \\SSMITH (Non-Privileged)\\Sent Items\\n\" + \\\n",
    "        \"X-Origin: Smith-S\\n\" + \\\n",
    "        \"X-FileName: SSMITH (Non-Privileged).pst\\n\\n\" + \\\n",
    "        \"All I ever saw was the e-mail from the office.\\n\\n\" + \\\n",
    "        \"Mary\\n\\n\" + \\\n",
    "        \"-----Original Message-----\\n\" + \\\n",
    "        \"From:   Smith, John  \\n\" + \\\n",
    "        \"Sent:   Friday, August 10, 2005 13:07 PM\\n\" + \\\n",
    "        \"To:     Smith, Mary W.\\n\" + \\\n",
    "        \"Subject:        ABC\\n\\n\" + \\\n",
    "        \"Have you heard any more regarding the ABC sale? I guess that means that \" + \\\n",
    "        \"it's no big deal here, but you think they would have send something.\\n\\n\\n\" + \\\n",
    "        \"John Smith\\n\" + \\\n",
    "        \"123-456-7890\\n\"\n",
    "\n",
    "# convert string data to list to feed into the labeler\n",
    "data = [data]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "concerned-segment",
   "metadata": {},
   "source": [
    "By default, the Labeler predicts the results at the character level for unstructured text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "junior-acrobat",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "labeler = dp.DataLabeler(labeler_type='unstructured')\n",
    "\n",
    "# make predictions and get labels per character\n",
    "predictions = labeler.predict(data)\n",
    "\n",
    "# display results\n",
    "print(predictions['pred'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "individual-diabetes",
   "metadata": {},
   "source": [
    "In addition to the character-level result, the Labeler provides the results at the word level following the standard NER (Named Entity Recognition), e.g., utilized by spaCy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "optical-universe",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# convert prediction to word format and ner format\n",
    "# Set the output to the NER format (start position, end position, label)\n",
    "labeler.set_params(\n",
    "    { 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } } \n",
    ")\n",
    "\n",
    "# make predictions and get labels per character\n",
    "predictions = labeler.predict(data)\n",
    "\n",
    "# display results\n",
    "print('\\n')\n",
    "print('=======================Prediction======================\\n')\n",
    "for pred in predictions['pred'][0]:\n",
    "    print('{}: {}'.format(data[0][pred[0]: pred[1]], pred[2]))\n",
    "    print('--------------------------------------------------------')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "behavioral-tourism",
   "metadata": {},
   "source": [
    "Here, the Labeler is able to identify sensitive information such as datetime, email address, person names, and phone number in an email sample. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nasty-disney",
   "metadata": {},
   "source": [
    "## Train the Labeler from Scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "destroyed-twist",
   "metadata": {},
   "source": [
    "The Labeler can be trained from scratch with a new list of labels. Below, we show an example of training the Labeler on a dataset with labels given as the columns of that dataset. For brevity's sake, let's only train a few epochs with a subset of a dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "utility-evaluation",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = dp.Data(\"../dataprofiler/tests/data/csv/SchoolDataSmall.csv\")\n",
    "df = data.data[[\"OPEID6\", \"INSTURL\", \"SEARCH_STRING\"]]\n",
    "df.head()\n",
    "\n",
    "# split data to training and test set\n",
    "split_ratio = 0.2\n",
    "df = df.sample(frac=1).reset_index(drop=True)\n",
    "data_train = df[:int((1 - split_ratio) * len(df))]\n",
    "data_test = df[int((1 - split_ratio) * len(df)):]\n",
    "\n",
    "# train a new labeler with column names as labels\n",
    "if not os.path.exists('data_labeler_saved'):\n",
    "    os.makedirs('data_labeler_saved')\n",
    "\n",
    "labeler = dp.train_structured_labeler(\n",
    "    data=data_train,\n",
    "    save_dirpath=\"data_labeler_saved\",\n",
    "    epochs=10,\n",
    "    default_label=\"OPEID6\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "utility-torture",
   "metadata": {},
   "source": [
    "The trained Labeler is then used by the Data Profiler to provide the prediction on the new dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "answering-panel",
   "metadata": {},
   "outputs": [],
   "source": [
    "# predict with the labeler object\n",
    "profile_options.set({'structured_options.data_labeler.data_labeler_object': labeler})\n",
    "profile = dp.Profiler(data_test, options=profile_options)\n",
    "\n",
    "# get the prediction from the data profiler\n",
    "results = profile.report()\n",
    "print(get_structured_results(results))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "polish-stand",
   "metadata": {},
   "source": [
    "Another way to use the trained Labeler is through the directory path of the saved labeler."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "industrial-characterization",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# predict with the labeler loaded from path\n",
    "profile_options.set({'structured_options.data_labeler.data_labeler_dirpath': 'data_labeler_saved'})\n",
    "profile = dp.Profiler(data_test, options=profile_options)\n",
    "\n",
    "# get the prediction from the data profiler\n",
    "results = profile.report()\n",
    "print(get_structured_results(results))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2acedba0",
   "metadata": {},
   "source": [
    "## Transfer Learning a Labeler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f15fb1f",
   "metadata": {},
   "source": [
    "Instead of training a model from scratch, we can also transfer learn to improve the model and/or extend the labels. Again for brevity's sake, let's only train a few epochs with a small dataset at the cost of accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0104c374",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = dp.Data(\"../dataprofiler/tests/data/csv/SchoolDataSmall.csv\")\n",
    "df_data = data.data[[\"OPEID6\", \"INSTURL\", \"SEARCH_STRING\"]]\n",
    "\n",
    "\n",
    "# prep data\n",
    "df_data = df_data.reset_index(drop=True).melt()\n",
    "df_data.columns = [1, 0]  # labels=1, values=0 in that order\n",
    "df_data = df_data.astype(str)\n",
    "new_labels = df_data[1].unique().tolist()\n",
    "\n",
    "# load structured Labeler w/ trainable set to True\n",
    "labeler = dp.DataLabeler(labeler_type='structured', trainable=True)\n",
    "\n",
    "# Reconstruct the model to add each new label\n",
    "for label in new_labels:\n",
    "    labeler.add_label(label)\n",
    "\n",
    "# this will use transfer learning to retrain the labeler on your new\n",
    "# dataset and labels.\n",
    "# Setting labels with a list of labels or label mapping will overwrite the existing labels with new ones\n",
    "# Setting the reset_weights parameter to false allows transfer learning to occur\n",
    "model_results = labeler.fit(x=df_data[0], y=df_data[1], validation_split=0.2, \n",
    "                                 epochs=10, labels=None, reset_weights=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae78745f",
   "metadata": {},
   "source": [
    "Let's display the training results of the last epoch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b764aa8c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"{:16s}  Precision  Recall  F1-score  Support\".format(\"\"))\n",
    "for item in model_results[-1][2]:\n",
    "    print(\"{:16s}  {:4.3f}      {:4.3f}   {:4.3f}     {:7.0f}\".format(item,\n",
    "                                                                      model_results[-1][2][item][\"precision\"],\n",
    "                                                                      model_results[-1][2][item][\"recall\"],\n",
    "                                                                      model_results[-1][2][item][\"f1-score\"],\n",
    "                                                                      model_results[-1][2][item][\"support\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44009522",
   "metadata": {},
   "source": [
    "It is now trained to detect additional labels! The model results here show all the labels training accuracy. Since only new labels existed in the dataset, only the new labels are given accuracy scores. Keep in mind this is a small dataset for brevity's sake and that real training would involve more samples and better results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e110ee1c",
   "metadata": {},
   "source": [
    "## Saving and Loading a Labeler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c484d193",
   "metadata": {},
   "source": [
    "The Labeler can easily be saved or loaded with one simple line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d8684fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure save directory exists\n",
    "if not os.path.exists('my_labeler'):\n",
    "    os.makedirs('my_labeler')\n",
    "\n",
    "# Saving the labeler\n",
    "labeler.save_to_disk(\"my_labeler\")\n",
    "\n",
    "# Loading the labeler\n",
    "labeler = dp.DataLabeler(labeler_type='structured', dirpath=\"my_labeler\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d36dec8",
   "metadata": {},
   "source": [
    "## Building a Labeler from the Ground Up"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59346d2b",
   "metadata": {},
   "source": [
    "As mentioned earlier, the labeler is comprised of three components, and each of the compenents can be created and interchanged in the the labeler pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6506ef97",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "from dataprofiler.labelers.character_level_cnn_model import \\\n",
    "    CharacterLevelCnnModel\n",
    "from dataprofiler.labelers.data_processing import \\\n",
    "    StructCharPreprocessor, StructCharPostprocessor\n",
    "\n",
    "model = CharacterLevelCnnModel({\"PAD\":0, \"UNKNOWN\":1, \"Test_Label\":2})\n",
    "preprocessor = StructCharPreprocessor()\n",
    "postprocessor = StructCharPostprocessor()\n",
    "\n",
    "labeler = dp.DataLabeler(labeler_type='structured')\n",
    "labeler.set_preprocessor(preprocessor)\n",
    "labeler.set_model(model)\n",
    "labeler.set_postprocessor(postprocessor)\n",
    "\n",
    "# check for basic compatibility between the processors and the model\n",
    "labeler.check_pipeline()\n",
    "\n",
    "# Optionally set the parameters\n",
    "parameters={\n",
    "    'preprocessor':{\n",
    "        'max_length': 100,\n",
    "    },\n",
    "    'model':{\n",
    "        'max_length': 100,\n",
    "    },\n",
    "    'postprocessor':{\n",
    "        'random_state': random.Random(1)\n",
    "    }\n",
    "} \n",
    "labeler.set_params(parameters)\n",
    "\n",
    "labeler.help()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f020d7f",
   "metadata": {},
   "source": [
    "The components can each be created if you inherit the BaseModel and BaseProcessor for the model and processors, respectively. More info can be found about coding your own components in the Labeler section of the [documentation]( https://capitalone.github.io/dataprofiler). In summary, the Data Profiler open source library can be used to scan sensitive information in both structured and unstructured data with different file types. It supports multiple input formats and output formats at word and character levels. Users can also train the labeler on their own datasets."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
